Psychometric properties of outcome measurement instruments for ANCA-associated vasculitis: a systematic literature review

Abstract Objectives To systematically review the psychometric properties of outcome measurement instruments used in ANCA-associated vasculitis (AAV). Methods Medline, EMBASE, Cochrane, Scopus and Web of Science were searched from inception to 14 July 2020 for validation studies of instruments used in AAV. Following the COnsensus-based Standards for the selection of health status Measurement INstruments (COSMIN) and OMERACT frameworks, different psychometric properties (validity, reliability, responsiveness and feasibility) were summarized. Risk of bias was assessed according to the COSMIN checklist. Results From 2505 articles identified, 32 met the predefined selection criteria, providing information on 22 instruments assessing disease activity (n = 7), damage (n = 2), activity and damage (n = 1), health-related quality of life (HRQoL; n = 9) and function (n = 3). Most of the instruments were tested in AAV as a group or in granulomatosis with polyangiitis only. The BVAS, any version, the Vasculitis Damage Index (VDI) and the AAV-Patient-Reported Outcome (AAV-PRO) have been more extensively validated than the other instruments. BVAS for Wegener Granulomatosis (BVAS/WG) has been shown to be valid for measuring disease activity [correlation with Physician global assessment (r = 0.90)], reliability (inter-observer intraclass correlation coefficient = 0.97), responsiveness and feasibility. For damage, VDI was shown to be moderately valid (correlations with BVAS version 3 at 6 months r = 0.14, BVAS/WG at 1 year r = 0.40 and 5 years r = 0.20), and feasible. For HRQoL, AAV-PRO demonstrated validity (correlations of the six AAV-PRO domains with EQ-5D-5L: −0.78 to −0.55; discrimination between active disease and remission, P < 0.0001 for all comparisons). The overall performance of instruments assessing function was low-to-moderate. Conclusion Among the 22 outcome measurement instruments used for AAV, BVAS (any version), VDI and AAV-PRO had the strongest psychometric properties.


Introduction
ANCA-associated vasculitis (AAV) encompasses three major systemic clinical conditions caused by inflammation of the small blood vessels: granulomatosis with polyangiitis (GPA), eosinophilic granulomatosis with polyangiitis (EGPA) and microscopic polyangiitis (MPA, which includes the renal-limited form) [1]. GPA and MPA are characterized by heterogeneous manifestations and a great deal of clinical overlap between the two diseases. However, GPA has a greater predilection for the upper and lower respiratory tracts (with characteristic destructive lesions in the nasal septum, lung nodules and cavities), and MPA more frequently involves glomerulonephritis [1]. Asthma, nasal polyps and peripheral hyper-eosinophilia are unique features of EGPA, which represents $10-20% of patients with AAV, and has been treated as a separate clinical entity from GPA and MPA in clinical trials [1]. AAV often has a major impact on patients' lives through both acute illness and over the long-term, affecting several major organs and threatening life [2].
To measure disease severity and response to treatment, and enable comparability across studies in AAV, defining standardized outcome measures is of utmost importance. This need has been well recognized by the OMERACT Vasculitis Working Group: a core set of domains and associated outcome measures have been endorsed to be used for AAV clinical trials [3,4], i.e. disease activity, damage assessment, patient-reported outcomes (PRO) and mortality. Since the publication of the OMERACT core set for AAV, a substantial amount of additional research has been conducted on the performance of outcome measure instruments in AAV that assess various domains. For every instrument assessing the domains identified by OMERACT for AAV, the characteristics of the single instrument, such as the extent to which an instrument measures what it asserts to measure (i.e. validity), or the instrument's ability to produce stable and consistent results (i.e. reliability), are collectively called psychometric properties.
OMERACT uses a staged process to establish core sets by first establishing the key domains of illness, and then identifying validated instruments to assess the domains [4], which is the result of a consensus expert opinion that did not rely on a systematic literature review of the available instruments used in AAV [5]. Systematic reviews of clinical trials and observational studies help catalogue outcome measures used and domains targeted for the disease of interest, inform groups to work towards agreement on relevant domains of illness and summarize the psychometric properties of instruments measuring each domain. Recently, a systematic review on the use and reporting of outcome measures in randomized clinical trials of AAV showed that a large degree of heterogeneity exists among instruments used in endpoint definitions and timing of assessments [6]. Therefore, to make informed choices of instruments to use to measure each domain, it would be useful to know the instruments' psychometric properties.
The EULAR Outcomes Measures Library (OML) is an international collaborative initiative that is an openaccess repository of outcomes measures in rheumatology [7] that uses the COnsensus-based Standards for the selection of health status Measurement INstruments (COSMIN) checklist to appraise the instruments [8]. One approach to populate the OML is through conducting systematic reviews of existing instruments for any given disease or domain and appraise the instruments' psychometric properties.
Based on the interest of the vasculitis community of patients, clinicians and investigators in appraising the existing psychometric properties of instruments used for AAV, a systematic review was designed in collaboration between the OMERACT Vasculitis Working Group and the EULAR OML. We have reviewed and summarized the current evidence on psychometric properties of outcome measurement instruments used in AAV, covering each core domain as defined by OMERACT.

Methods
encompassing various systemic vasculitis but not presenting data for AAV separately were excluded (Supplementary Material S2, available at Rheumatology online).
The search strategy was designed and conducted by an experienced librarian (L.C.H.) with input from the study's principal investigators. No limits on publication language or dates were imposed. Two reviewers (A.B., G.B.) screened independently titles and abstracts followed by full-text review of selected articles. Data extraction from papers was also independently performed by two investigators (A.B., G.B.). In case of disagreement, senior reviewers (S.R. and L.C.) helped to reach consensus.

Data extraction
Data concerning study and instrument description and validation were collected (further details in Supplementary Material S4, available at Rheumatology online).

Risk of bias assessment
Risk of bias was assessed according to the COSMIN checklist, a checklist that can be used at the level of the individual studies and of the instruments [8]. The studies were evaluated rating each property, when present, from 'inadequate' to 'very good' (not available, inadequate, doubtful, adequate, very good). The final risk of bias for each study was evaluated as 'low', 'moderate' or 'high' based on the evaluation of all the properties.

Search results and study features
From 2505 references identified in the search, 156 were reviewed with full-text and 32 met the predefined selection criteria and were included in the final analyses (supplementary Fig. S1, available at Rheumatology online). The characteristics of the included studies, including instruments used, the OMERACT domains assessed, and risk of bias are reported in Table 1. Five studies focussed on the development of instruments [11][12][13][14][15], 24 were validation studies  and 3 pursued both objectives [40][41][42]. All studies involved an adult (!18 years old) population except for one that focussed on a paediatric population with AAV [20]. The baseline characteristics of the study populations varied across the different studies, in terms of AAV subsets assessed, distribution by age and sex, sample size and country (supplementary Table  S1, available at Rheumatology online).

Risk of bias
The following studies had potential high risk of bias according to the COSMIN checklist: validation of the Routine Assessment of Patient Index Data 3 (RAPID3) [16], the multivariable index for AAV (MVIA) [12], the Ear, Nose and Throat (ENT)/GPA Disease Activity Score (ENT/GPA DAS) [11] and the Overall Disability Sum Score (ODSS) [26] (Table 1 for further details).

Overview of the instruments' psychometric properties assessed
The studies identified provided information on 22 instruments, 7 assessing disease activity, 2 assessing disease damage, 1 assessing both disease activity and damage, 9 assessing patient-reported outcomes and 3 assessing function.
Disease activity was assessed with BVAS and its revisions, the most widely accepted numeric scores for the assessment of disease-specific activity for AAV [20,41,42]; by ENT/GPA DAS, proposed for the assessment of disease activity in patients with otorhinolaryngological manifestations of GPA [11]; by Disease Extent Index (DEI), a validated instrument to quantitatively assessed disease extent and activity in patients with AAV [17]; and by MVIA and Vasculitis Activity Index (VAI), the first designed to estimate activity at diagnosis (and to predict all-cause mortality) in patients with AAV [12] and the second to incorporate appropriately weighted clinical measurements reflecting disease activity in systemic necrotizing vasculitis [15].
Disease damage was assessed with vasculitis damage index (VDI), a validated and widely used method for measuring damage sustained from vasculitis or its treatment [21], and Combined Damage Assessment Index (CDA), an instrument stemmed from the VDI that includes additional items of damage but not captured by individual items on the VDI [32]. ENT assessment score (ENTAS) and its newer version ENTAS 2, were both developed for a structured, reliable ENT assessment in patient with GPA and to evaluate disease activity and disease activity and damage, respectively [18,19].
Health-related quality of life (HRQoL) was assessed by AAV-specific instruments, i.e. AAV-Patient-Reported Outcome (AAV-PRO) [13,40], and Vasculitis Self-Management Scale (VSMS) [14], a measure of illness self-management for adults living with AAV; and by nonspecific instruments, i.e. Patient-Reported Outcome Measurement Information System (PROMIS), a 10-item collection of self-reported health completed by vasculitis patients in 40-55 s [38]; Study Short-Form 36 (SF-36) [27,43], a set of generic, coherent and easily administered quality-of-life measures; Multidimensional Fatigue Inventory-20 (MFI-20) [28], a 20-item scale designed to evaluate five dimensions of fatigue, i.e. general fatigue, physical fatigue, reduced motivation, reduced activity and mental fatigue; Patient Global Assessment (PtGA) assessed as 100-mm visual analogue scales [44]; Brief Illness Perception Questionnaire (BIPQ) [29], a nine-item    scale designed to rapidly assess the cognitive and emotional representations of illness; and the revised Illness Perception Questionnaire (IPQ-R) [30], a recently developed revised version of the IPQ measuring the five coherent components that together make up the patient's perception of their illness; and RAPID3 [16], an index of patient-reported measures completed by patients and calculated by a health professional in 5 s. Function was assessed with non-AAV-specific instruments: HAQ [25,45], a self-reported measure of functional status (disability) used in many diseases; overall disability sum score (ODSS) [26], an instrument used for disability in immune mediated polyneuropathies; and Composite Autonomic Symptom Score 31 (COMPASS31) [24], a generic instrument to assess autonomic symptoms across multiple domains.
The psychometric properties of the 22 instruments are summarized in Fig. 1. There was a wide heterogeneity in the psychometric properties assessed for each instrument. A few psychometric properties have been considered in each study, with validity being the most frequently assessed aspect, in 82% of the instruments, but few properties other than construct validity were reported. Overall, the BVAS for disease activity, the VDI for damage, and the AAV-PRO for HRQoL/PRO, were the instruments with the best performance within the psychometric properties assessed (Fig. 1).

Instruments tested in different AAV subsets
Most of the instruments were tested in AAV as a group (i.e. including GPA, MPA and EGPA) or GPA only, followed by MPA and GPA, MPA only and EGPA only. Fig. 2 represents the instruments tested in the different AAV subsets by the OMERACT domain assessed. Among others, the BVAS version 3 (BVAS.v3) was validated in all AAV, while BVAS for Wegener Granulomatosis (BVAS/WG) in GPA and MPA, and the DEI in GPA only. Specific studies aiming to validate BVAS and VDI have not yet been performed in EGPA only. AAV-PRO was developed and validated in all forms of AAV, while the ODSS has been validated in EGPA only.

Validity
Validity was analysed very differently across studies, making comparisons more difficult ( Table 2). For disease activity, BVAS/WG demonstrated the most adequate construct validity, having the highest correlation with Physician Global Assessment (PhGA; r ¼ 0.92 and r ¼ 90 in the weighted version). The DEI and BVAS.v3 followed the BVAS/WG but did not have the same level of construct validity (for example, BVAS.v3 had lower correlation with PhGA, r ¼ 0.38, compared with the BVAS/WG). The MVIA has weak construct validity (correlation with BVAS r ¼ 0.37). The VAI for disease activity and the CDA for damage were shown to differ significantly across different type of vasculitis, but besides discriminant validity, no other psychometric measures have been tested in patients with AAV for these two instruments.
For damage, the VDI was shown to have some construct validity, but its adequacy was low (correlations with BVAS.v3 at 6 months r ¼ 0.14, BVAS/WG at 1 year r ¼ 0.40 and 5 years r ¼ 0.20). AAV-PRO is a PRO specifically developed to assess HRQoL in patients with AAV. AAV-PRO had the best performance for validity (construct validity: correlations of the six AAV-PRO domains with EQ-5D-5L: À0.78 to À0.55; discrimination validity: discrimination between active disease vs remission, P < 0.0001 for all comparisons; and high face validity and content validity: Smith's Salience Index was used to identify the most salient items). The validity was overall moderate for the other AAV-specific instruments (VSMS, since it was possible to extrapolate data specific for AAV only for discriminant validity and not for construct validity, one of the aims of the study) and several instruments not specific for AAV assessing HRQoL/PRO (PROMIS, SF-36, MFI-20, PtGA, BIPQ, IPQ-R and RAPID3) and function (HAQ, ODSS and COMPASS31).

Reliability
For disease activity, BVAS/WG was shown to have the highest intraclass correlation coefficient (ICC) (ICC ¼ 0.97), followed by DEI (ICC ¼ 0.96), while for function ODSS was shown to have the highest ICC (ICC ¼ 0.96). ENTAS and ENTAS 2 each have moderate inter-and intra-observer reliability, while the instrument domains of AAV-PRO and VSMS have intra-observer reliability ICCs ranging between 0.89 and 0.96, and between 0.51 and 0.76, respectively (Table 3). Reliability has not been assessed for the VDI. Internal consistency was demonstrated for both the AAV-PRO and the HAQ (Cronbach's alphas 0.77-0.96 and 0.91-0.93, respectively).

Responsiveness
BVAS.v3, BVAS/WG, VDI and SF-36 have been shown to be sensitive to change in randomized controlled trials of AAV (Table 4). DEI has a mean standardized response of 2.37 S.D. units, while in AAV-PRO responsiveness was moderate (effects size ranging from 0.0 to 0.09 for 'no change' vs from 0.21 to 0.28 for 'much better'), likely limited by the short time-interval of 3 months in patients that were in remission in 70% of cases (and therefore were not expected to change in clinical state, as in the context of a clinical trial). ODSS has been shown to change moderately during follow-up from baseline [baseline to 6 months (4.2 6 2.4-2.9 6 1.5, P ¼ 0.0001)].

Feasibility
The majority of instruments were shown to be feasible, except for two AAV-specific instruments, the ENTAS and the ENTAS 2, due to complexity of the instrument, time needed, necessity of training and raters limited to ENT specialists; and two non-AAV specific instruments, the ODSS and the COMPASS31, due to the complexity of the instruments and necessity for training.

Discussion
This is the first systematic review summarizing the psychometric properties of outcome measurement instruments developed or validated for AAV. Twenty-two instruments covering the OMERACT domains of disease activity, damage, QoL/PRO and function have had their psychometric properties assessed. The domains identified in this systematic review are endorsed by OMERACT as the core set outcomes for randomized controlled trials of AAV [3,5,38,46]. The majority of instruments were developed or validated in AAV as a group or in GPA only, while only one instrument was specifically validated in patients with EGPA [26]. All instruments but one [20] were validated in an adult population. Overall, the instruments with strongest psychometric properties were the BVAS (all versions) for disease activity, the VDI for damage and the AAV-PRO for PRO/quality of life [20-22, 38, 40-42].
This systematic approach showed that the instruments developed or validated for vasculitis in general (such as the BVAS or VDI) or specifically for AAV (such as AAV-PRO), were those that performed the best [20-22, 38, 40-42], while the non-vasculitis or non-AAV-specific instruments performed on average worse, suggesting that active research in vasculitis is necessary to develop instruments optimally measuring disease domain(s) specific for AAV. The best example is for the AAV-PRO that, as compared with the other non-vasculitis specific HRQoL instruments, has high levels of almost all the properties assessed, i.e. the validity, reliability, responsiveness and feasibility, while the performance of SF-36, MFI-20 and PtGA ranged from low-to-moderate in the properties assessed [28,43,44]. However, AAV-PRO has not been used or validated in a clinical trial. In addition, these findings indirectly confirmed the expertbased opinion of the OMERACT group [4], which is reassuring.

Assessment of disease activity
For disease activity, the BVAS/WG [41] performed better compared with BVAS.v3 [20], but the difference might lie in the fact that the first, specifically designed for GPA, was validated in patients with GPA, while the second was validated in all AAV. DEI showed adequate validity with good correlation with BVAS during active disease, and non-significant correlation with BVAS during disease remission [17]. The DEI aims to document organ involvement typically attributable to active vasculitis, which is linked to disease activity, while the BVAS measures disease activity, with the two measures providing complementary information. All these indices are highly correlated among themselves and in hands of experts, the instruments are highly reliable, as shown in an exercise comparing different AAV activity instruments [47]. ENT/GPA DAS, ENTAS and ENTAS 2, which assess organ-specific disease activity (i.e. the ENT domain) [11,18,19,48], overall performed poorly. Interestingly, no EGPA-specific disease activity instruments were identified by this systematic review. Since psychometric features of BVAS were   never validated in EGPA only, and it is believed that this tool does not adequately capture the full range of manifestations of EGPA, an EGPA-specific instrument would likely be a useful advance for the field.
Surprisingly, no studies have been found specifically assessing the role of PhGA in AAV, probably the most widely used measure in clinical practice. For some instruments, such as the BVAS, the PhGA is the major comparator to be correlated with the BVAS [41,42]. In one study, PhGA among experts was collected in order to be compared with BVAS/WG, and thus interand intra-observer reliability determined (ICC of 0.96 and 0.28, respectively). These data are in contrast to other rheumatic diseases such as lupus, in which PhGA has been shown to be strongly influenced by the clinical experience of the physician, therefore producing a wide inter-observer variability, challenging comparison across patients [49]. In AAV, baseline PtGA-PhGA discordance was inversely associated with newly diagnosed disease (odds ratio 0.37, 95% CI 0.20, 0.68) [44]; however, no paper focussing on the psychometric properties of PhGA in AAV has been retrieved by this search.

Assessment of disease damage
Not surprisingly, the VDI correlated weakly with disease activity and SF-36 [21]. As previously indicated by OMERACT, there are issues with content validity of the VDI as it may not detect all forms of damage incurred in AAV and items of damage not attributable to vasculitis are also recorded in the VDI, while surprisingly no reliability has been assessed. Therefore, research on other AAV-specific disease damage instruments is ongoing, as for ANCA-Vasculitis Index of Damage (AVID) (not yet validated) [50] and CDA [32], aiming to capture more items of damage than VDI.

Assessment of HRQoL and other PRO
There has been growing interest in the importance of integrating patients' perspectives on the impact of their disease, and quality of life has been proposed by OMERACT as a core domain to be assessed in clinical trials of AAV. Except for AAV-PRO, PRO or quality of life through different outcomes (fatigue, sleep, mental health, pain, physical and social functioning) was assessed using generic instruments not validated for AAV or for vasculitis in general. Among others, the SF-36 has been used; it covers eight domains, including physical and social functioning and mental health [51] and has been widely used in patients with inflammatory musculoskeletal disorders [52] and scantly in large vessel vasculitis [53]. In this review, SF-36 performed worse than AAV-PRO, and was often used as a comparator for other domains (such as VDI) [21]. VSMS, the other AAVspecific instrument to measure illness self-management, had a poor-to-moderate adequacy of validity [14].

Assessment of function
Function was assessed with non-AAV-specific instruments, and all these instruments overall have a low-tomoderate performance. With the attempt to update and expand the OMERACT existing expert-driven core set for AAV [3,5], several projects focussing on function in AAV have been evaluated [46]. As shown by this review, function remains an understudied domain in AAV, likely as a consequence of the numerous organs involved and the differences across the AAV subsets. Among others, ODSS [26], a validated score for immune-mediated polyneuropathies, has good inter-observer reliability, but only moderately adequate validity, even though it should be noted that ODSS has been retrospectively tested in a small population of EGPA patients (25 with peripheral neuropathy).

Limitations and strengths
Surprisingly, the number of studies developed for or specifically assessing psychometric properties of the instrument assessing the OMERACT domains was relatively small. This might be a consequence of the eligibility criteria, which excluded studies that did not provide the performance of the psychometric measures of the instruments in AAV separately, in which a broader 'vasculitis' population including patients without AAV was studied, leading to the exclusion of some seminal papers, e.g. the ones describing the validation of the BVAS and VDI [54][55][56][57], other large validation studies [43,58], or limiting the number of psychometric measures that can be extrapolated for AAV from the single studies [34,38,[40][41][42]. Indeed, several studies, such as those assessing VAI, CDA, PROMIS, BVAS.v3, VDI, BIPQ and IPQ-R [14,15,[29][30][31][32][54][55][56], assessed numerous psychometric measures of these instruments tested in populations of patients with various forms of vasculitis and not only AAV, but rarely performed subset analyses on AAV only, therefore limiting the available data specific for AAV. A strength of this study is that data were collected in a systematic literature review following state-of-the-art practices [8,10]. A limitation is heterogeneity, since a single psychometric property of different instruments can be assessed in different ways, limiting the head-to-head comparability of their performances. Consequently, the evaluation of psychometric properties can be assessed with

Domain
Instrument Responsiveness different methods (e.g. to evaluate reliability, Cohen's K, ICC), with no direct comparability. A certain degree of heterogeneity is expected, since some instruments provide a single numeric score (e.g. BVAS, PtGA), others provide a profile (AAV-PRO, SF 36, and most are multi-item, although single-item instruments exist (PtGA).

Conclusion
In conclusion, 22 instruments covering the OMERACT AAV core set domains of disease activity, damage, HRQoL/PRO and function had their psychometric properties assessed. Overall, the BVAS (any version), the VDI and the AAV-PRO were the instruments with the strongest psychometric properties. The majority of outcome instruments used for AAV were developed or validated for AAV as a group or GPA only, while specific studies for MPA or EGPA are lacking. The development and validation of outcome measurement instruments specific for AAV is warranted, since AAV-specific instruments are likely to capture a fuller range of disease manifestations yielding more precise measurements within the target disease, possibly assessing and reporting psychometric properties in a way that enables comparisons across instruments.